frequency band
Structured Spectral Reasoning for Frequency-Adaptive Multimodal Recommendation
Multimodal recommendation aims to integrate collaborative signals with heterogeneous content such as visual and textual information, but remains challenged by modality-specific noise, semantic inconsistency, and unstable propagation over user-item graphs. These issues are often exacerbated by naive fusion or shallow modeling strategies, leading to degraded generalization and poor robustness. While recent work has explored the frequency domain as a lens to separate stable from noisy signals, most methods rely on static filtering or reweighting, lacking the ability to reason over spectral structure or adapt to modality-specific reliability. To address these challenges, we propose a Structured Spectral Reasoning (SSR) framework for frequency-aware multimodal recommendation. Our method follows a four-stage pipeline: (i) Decompose graph-based multimodal signals into spectral bands via graph-guided transformations to isolate semantic granularity; (ii) Modulate band-level reliability with spectral band masking, a training-time masking with representation-consistency objective that suppresses brittle frequency components; (iii) Fuse complementary frequency cues using hyperspectral reasoning with low-rank cross-band interaction; and (iv) Align modality-specific spectral features via contrastive regularization to promote semantic and structural consistency. Experiments on three real-world benchmarks show consistent gains over strong baselines, particularly under sparse and cold-start settings. Additional analyses indicate that structured spectral modeling improves robustness and provides clearer diagnostics of how different bands contribute to performance. The code is available at https://github.com/llm-ml/SSR.git.
NFIG: Multi-Scale Autoregressive Image Generation via Frequency Ordering
Autoregressive models have achieved significant success in image generation. However, unlike the inherent hierarchical structure of image information in the spectral domain, standard autoregressive methods typically generate pixels sequentially in a fixed spatial order. To better leverage this spectral hierarchy, we introduce NextFrequency Image Generation (NFIG). NFIG is a novel framework that decomposes the image generation process into multiple frequency-guided stages.
One Stone with Two Birds: A Null-Text-Null Frequency-Aware Diffusion Models for Text-Guided Image Inpainting
Text-guided image inpainting aims at reconstructing the masked regions as per text prompts, where the longstanding challenges lie in the preservation for unmasked regions, while achieving the semantics consistency between unmasked and inpainted masked regions. Previous arts failed to address both of them, always with either of them to be remedied. Such facts, as we observed, stem from the entanglement of the hybrid (e.g., mid-and-low) frequency bands that encode varied image properties, which exhibit different robustness to text prompts during the denoising process. In this paper, we propose a null-text-null frequency-aware diffusion models, dubbed NTN-Diff, for text-guided image inpainting, by decomposing the semantics consistency across masked and unmasked regions into the consistencies as per each frequency band, while preserving the unmasked regions, to circumvent two challenges in a row. Based on the diffusion process, we further divide the denoising process into early (high-level noise) and late (low-level noise) stages, where the mid-and-low frequency bands are disentangled during the denoising process. As observed, the stable mid-frequency band is progressively denoised to be semantically aligned during text-guided denoising process, which, meanwhile, serves as the guidance to the null-text denoising process to denoise low-frequency band for the masked regions, followed by a subsequent text-guided denoising process at late stage, to achieve the semantics consistency for mid-and-low frequency bands across masked and unmasked regions, while preserve the unmasked regions.
Page 20 of
A.1 Frequency ablation study We perform an ablation study on the coarse-to-fine parameter ฮฑd and the number of frequency bands L. In Figure 1, we show the surface reconstruction results of the DTUBuddha model under different frequency parameters. Each model is trained for 300K iterations. In the first row we show the results of surface reconstruction quality under different coarse-to-fine parameters ฮฑd. It can be seen that when the parameter is too small, the surface reconstruction tends to be oversmoothed. When the parameter is too large, many artifacts will appear in the reconstruction results.
Supplementary Materia: Revisiting Visual Model Robustness: A Frequency Long-Tailed Distribution View Zhiyu Lin
Fan et al. [2021] incorporates high-frequency views into contrastive learning, leading to the transfer However, there are also several works that challenge the validity of this assumption. Yin et al. [2019] proposes a robustness analysis strategy based on Fourier Heatmaps, which utilizes a model's sensitivity to frequency-bases. Maiya et al. [2021] believes that model robustness does not have an intrinsic connection In addition to the perspective on frequency components, Chen et al. [2021] has shown that the CNN model should be consistent with the Human Visual System, with To show the power law distribution of natural images, we select CIFAR-10 Krizhevsky et al. [2009], Tiny-ImageNet Le and Y ang [2015] and ImageNet Deng et al. [2009] to conduct experiments. We show an example of division on ImageNet, as shown in Fig.2, in which the high-and low-frequency components of the image obtained according to the division radius are also in line with our We conduct experiments on naturally trained models. We conduct experiments on test set of CIFAR10, Tiny-ImageNet, ImageNet-1k datasets.